NorGramBank: A 'Deep' Treebank for Norwegian
نویسندگان
چکیده
We present NorGramBank, a treebank for Norwegian with highly detailed LFG analyses. It is one of many treebanks made available through the INESS treebanking infrastructure. NorGramBank was constructed as a parsebank, i.e. by automatically parsing a corpus, using the wide coverage grammar NorGram. One part consisting of 350,000 words has been manually disambiguated using computer-generated discriminants. A larger part of 50 M words has been stochastically disambiguated. The treebank is dynamic: by global reparsing at certain intervals it is kept compatible with the latest versions of the grammar and the lexicon, which are continually further developed in interaction with the annotators. A powerful query language, INESS Search, has been developed for search across formalisms in the INESS treebanks, including LFG cand f-structures. Evaluation shows that the grammar provides about 85% of randomly selected sentences with good analyses. Agreement among the annotators responsible for manual disambiguation is satisfactory, but also suggests desirable simplifications of the grammar.
منابع مشابه
Exploring Treebanks with INESS Search
We demonstrate the current state of INESS, the Infrastructure for the Exploration of Syntax and Semantics. INESS is making treebanks more accessible to the R&D community. Recent work includes the hosting of more treebanks, now covering more than fifty languages. Special attention is paid to NorGramBank, a large treebank for Norwegian, and to the inclusion of the Universal Dependency treebanks, ...
متن کاملThe Norwegian Dependency Treebank
The Norwegian Dependency Treebank is a new syntactic treebank for Norwegian Bokmål and Nynorsk with manual syntactic and morphological annotation, developed at the National Library of Norway in collaboration with the University of Oslo. It is the first publically available treebank for Norwegian. This paper presents the core principles behind the syntactic annotation and how these principles we...
متن کاملUniversal Dependencies for Norwegian
This article describes the conversion of the Norwegian Dependency Treebank to the Universal Dependencies scheme. This paper details the mapping of PoS tags, morphological features and dependency relations and provides a description of the structural changes made to NDT analyses in order to make it compliant with the UD guidelines. We further present PoS tagging and dependency parsing experiment...
متن کاملIdentifying complex phenomena in a corpus via a treebank lens
While syntactically annotated corpora known as treebanks have been available for many years, along with a variety of customized tools for querying these annotations, the mapping from actual annotations to relevant syntactic or semantic phenomena has been obscured by the coarse-grained labelling of nodes in the parse trees which make up the treebanks. This lack of linguistic detail has hampered ...
متن کاملJoint UD Parsing of Norwegian Bokmål and Nynorsk
This paper investigates interactions in parser performance for the two official standards for written Norwegian: Bokmål and Nynorsk. We demonstrate that while applying models across standards yields poor performance, combining the training data for both standards yields better results than previously achieved for each of them in isolation. This has immediate practical value for processing Norwe...
متن کامل